Diamonds Exploration by Chris Saden


Analysis

Load the Data

library(ggplot2)
data(diamonds)
theme_set(theme_minimal(20))

Summary of the Data Set

dim(diamonds)
## [1] 53940    10
names(diamonds)
##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"
str(diamonds)
## 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
summary(diamonds)
##      carat              cut        color        clarity     
##  Min.   :0.200   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.400   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.700   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.798   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.040   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.010                     I: 5422   VVS1   : 3655  
##                                    J: 2808   (Other): 2531  
##      depth          table          price             x        
##  Min.   :43.0   Min.   :43.0   Min.   :  326   Min.   : 0.00  
##  1st Qu.:61.0   1st Qu.:56.0   1st Qu.:  950   1st Qu.: 4.71  
##  Median :61.8   Median :57.0   Median : 2401   Median : 5.70  
##  Mean   :61.8   Mean   :57.5   Mean   : 3933   Mean   : 5.73  
##  3rd Qu.:62.5   3rd Qu.:59.0   3rd Qu.: 5324   3rd Qu.: 6.54  
##  Max.   :79.0   Max.   :95.0   Max.   :18823   Max.   :10.74  
##                                                               
##        y               z        
##  Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 4.72   1st Qu.: 2.91  
##  Median : 5.71   Median : 3.53  
##  Mean   : 5.73   Mean   : 3.54  
##  3rd Qu.: 6.54   3rd Qu.: 4.04  
##  Max.   :58.90   Max.   :31.80  
## 

Observations from the Summary

Most diamonds are of ideal cut. The median carat size is 0.7. Most diamonds have a color of G or better. About 75% of diamonds have carat weights less than 1. The median price for a diamonds $2401 and the max price is $18,823.

Understand the Distribution of Single Variables

I'm going to look at individual variables first to get a sense of the individual variables within the data set and make notes of things I want to continue to explore.

Price

qplot(price, data = diamonds)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk Price

qplot(price, data = diamonds, binwidth = 0.01) + scale_x_log10(breaks = c(100, 
    500, 1000, 1500, 2000, 5000, 10000, 15000))
## Warning: position_stack requires constant width: output may be incorrect

plot of chunk Price

Transformed the long tail data to better understand the distribution of price. The tranformed price distribution appears bimodal with the price peaking around 800 or so and again at 5000 or so. Why is there a gap at 1500? Are there really no diamonds with that price? I wonder what this plot looks like across the categorical variables of cut, color, and clarity.

Carat

qplot(carat, data = diamonds, binwidth = 0.1) + coord_cartesian(xlim = c(0, 
    2))

plot of chunk Carat

qplot(carat, data = diamonds, binwidth = 0.01) + coord_cartesian(xlim = c(0, 
    2))
## Warning: position_stack requires constant width: output may be incorrect

plot of chunk Carat

Some carat weights occur more often than other carat weights. I wonder how carat is connected to price, and I wonder if the carat values are specific to certain cuts of diamonds.

For now, I'm going to see which carat weights are most common.

sort(table(diamonds$carat), decreasing = T)
## 
##  0.3 0.31 1.01  0.7 0.32    1  0.9 0.41  0.4 0.71  0.5 0.33 0.51 0.34 1.02 
## 2604 2249 2242 1981 1840 1558 1485 1382 1299 1294 1258 1189 1127  910  883 
## 0.52 1.51  1.5 0.72 0.53 0.42 0.38 0.35  1.2 0.54 0.36 0.91 1.03 0.55 0.56 
##  817  807  793  764  709  706  670  667  645  625  572  570  523  496  492 
## 0.73 0.43 1.04 1.21 2.01 0.57 0.39 0.37 1.52 1.06 1.05 1.07 0.74 0.58 1.11 
##  492  488  475  473  440  430  398  394  381  373  361  342  322  310  308 
## 1.22 0.23 1.09  0.8 0.59 1.23  1.1    2 0.24 0.26 0.76 0.77 1.12 0.75 1.08 
##  300  293  287  284  282  279  278  265  254  253  251  251  251  249  246 
## 1.13 1.24 0.27  0.6 0.92 1.53  1.7 0.25 0.44 1.14 0.61 0.81 0.28 0.78 1.25 
##  246  236  233  228  226  220  215  212  212  207  204  200  198  187  187 
## 0.46 2.02 1.54 1.16 0.79 1.15 1.26 0.93 0.82 0.62 1.27 1.31 0.83 0.29 1.19 
##  178  177  174  172  155  149  146  142  140  135  134  133  131  130  126 
## 1.55 1.18  1.3 2.03 1.71 0.45 1.17 1.56 1.28 1.57 0.96 0.63 1.29 0.47  1.6 
##  124  123  122  122  119  110  110  109  106  106  103  102  101   99   95 
## 1.32 1.58 1.59 1.33 2.04 0.64 1.35 1.34 2.05 0.65 0.95 0.84 1.61 0.48 0.85 
##   89   89   89   87   86   80   77   68   67   65   65   64   64   63   62 
## 1.62 2.06 0.94 0.97 1.72 1.73  2.1 1.36  1.4 1.63 1.75 2.07 0.66 0.67 2.14 
##   61   60   59   59   57   52   52   50   50   50   50   50   48   48   48 
## 1.37 0.49 2.09 1.64 2.11 2.08 1.41 1.74 1.39 0.86 1.65  2.2 0.87 0.98 2.18 
##   46   45   45   43   43   41   40   40   36   34   32   32   31   31   31 
## 1.66 1.76 2.22 0.69 1.38 0.68 1.42 1.67 2.12 2.16 1.69 0.88 0.99 2.21 2.15 
##   30   28   27   26   26   25   25   25   25   25   24   23   23   23   22 
## 2.19 0.89 1.47  1.8 2.13  2.3 2.28 1.43 1.68 1.44 1.46 1.83 2.17 2.25 1.77 
##   22   21   21   21   21   21   20   19   19   18   18   18   18   18   17 
## 2.29  2.5 2.51 2.24 2.32 1.45 1.79 2.26 3.01 1.82 2.23 2.31  2.4  0.2 1.78 
##   17   17   17   16   16   15   15   15   14   13   13   13   13   12   12 
## 1.91 2.27 1.49 0.21 1.81 1.86 2.33 2.48 2.52 2.54 2.36 2.38 2.42 2.53    3 
##   12   12   11    9    9    9    9    9    9    9    8    8    8    8    8 
## 1.48 1.87  1.9 2.35 2.39 1.93 2.37 2.43 0.22 1.98 2.34 2.41 1.84 1.88 1.89 
##    7    7    7    7    7    6    6    6    5    5    5    5    4    4    4 
## 1.96 1.97 2.44 2.45 1.85 1.94 1.95 1.99 2.46 2.47 2.49 2.55 2.56 2.57 2.58 
##    4    4    4    4    3    3    3    3    3    3    3    3    3    3    3 
##  2.6 2.61 2.63 2.66 2.72 2.74 1.92 2.68 2.75  2.8 3.04 4.01 2.59 2.64 2.65 
##    3    3    3    3    3    3    2    2    2    2    2    2    1    1    1 
## 2.67  2.7 2.71 2.77 3.02 3.05 3.11 3.22 3.24  3.4  3.5 3.51 3.65 3.67    4 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
## 4.13  4.5 5.01 
##    1    1    1

Depth

qplot(depth, data = diamonds, binwidth = 0.1) + coord_cartesian(xlim = c(55, 
    70))
## Warning: position_stack requires constant width: output may be incorrect

plot of chunk Depth

summary(diamonds$depth)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    43.0    61.0    61.8    61.7    62.5    79.0

Most diamonds have a depth between 60 mm and 65 mm: median 61.8 mm and mean 61.75 mm.

Table

qplot(table, data = diamonds) + coord_cartesian(xlim = c(50, 70))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk Table

qplot(table, data = diamonds, binwidth = 0.1) + coord_cartesian(xlim = c(50, 
    70))
## Warning: position_stack requires constant width: output may be incorrect

plot of chunk Table

summary(diamonds$table)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    43.0    56.0    57.0    57.5    59.0    95.0

Setting the binwidth indicates that most table values are integers. Most diamonds have a table between 55 mm and 60 mm.

sort(table(diamonds$table), decreasing = T)
## 
##   56   57   58   59   55   60   54   61   62   63   53   64   65   66   52 
## 9881 9724 8369 6572 6268 4241 2594 2282 1273  588  567  260  146   91   56 
##   67 54.1 55.1 53.9 54.2 54.4 53.7 54.5 54.8 53.8 55.6 54.7   68 53.6 56.4 
##   42   30   30   28   28   28   25   24   24   22   22   21   21   20   20 
## 55.7 55.8 55.2 54.3 55.9 56.2 54.9 56.1 54.6 56.3 55.3 55.4 55.5 53.4 53.5 
##   19   19   18   17   17   17   16   16   15   15   13   13   13   12   12 
## 56.5 56.6 56.7 56.9 57.1 57.2 57.8 58.1 57.7 58.5 59.9 60.1 60.3 60.5   51 
##   11   11   11   11   11   11   11   11   10   10   10   10   10   10    9 
## 57.4 57.6 60.7   69   70 53.3 57.5 59.4 60.9 56.8 58.6 59.2 61.2 61.9 57.3 
##    9    9    9    9    9    8    8    8    8    7    7    7    7    7    6 
## 57.9 59.7 59.8 61.5 53.2 58.8 59.1 60.2 60.4 60.8 58.2 58.4 59.5 62.2 62.5 
##    6    6    6    6    5    5    5    5    5    5    4    4    4    4    4 
##   73 53.1 58.3 58.9 59.6 60.6 61.4   49   50 52.8 58.7 59.3 61.1 61.7 62.3 
##    4    3    3    3    3    3    3    2    2    2    2    2    2    2    2 
## 62.6 62.8   43   44 50.1 51.6 52.4 61.3 61.6 61.8 62.1 62.4 63.3 63.4 63.5 
##    2    2    1    1    1    1    1    1    1    1    1    1    1    1    1 
## 64.2 64.3 65.4   71   76   79   95 
##    1    1    1    1    1    1    1

Again, I wonder if this has anything to do with the cut of a diamond. Cut is the quality of a diamons may influence carat weight and is responsible for making a diamond sparkle. There's likely to be strong relationships among carat, table, cut, and price.

x dimension

qplot(x, data = diamonds, binwidth = 0.1)

plot of chunk X_Dim

qplot(x, data = diamonds, geom = "freqpoly", binwidth = 0.1)

plot of chunk X_Dim

Most diamonds have an x dimension between 4 mm and 7 mm.

y dimension

qplot(y, data = diamonds, binwidth = 0.1)
## Warning: position_stack requires constant width: output may be incorrect

plot of chunk Y_Dim

qplot(y, data = diamonds, binwidth = 0.1) + coord_cartesian(xlim = c(0, 10))
## Warning: position_stack requires constant width: output may be incorrect

plot of chunk Y_Dim

Again, most diamonds have a y dimension between 4 mm and 7 mm. There are some outliers for the y dimension.

z dimension

qplot(z, data = diamonds, binwidth = 0.1)
## Warning: position_stack requires constant width: output may be incorrect

plot of chunk Z_Dim

qplot(z, data = diamonds, binwidth = 0.1) + coord_cartesian(xlim = c(0, 7))
## Warning: position_stack requires constant width: output may be incorrect

plot of chunk Z_Dim

Most diamonds have a z dimension between 2 mm and 6 mm. There are some outliers for the z dimension too.

Carat Histogram by Cut

qplot(carat, data = diamonds, binwidth = 0.01, fill = cut) + coord_cartesian(xlim = c(0, 
    2)) + guides(fill = guide_legend(reverse = T))
## Warning: position_stack requires constant width: output may be incorrect

plot of chunk Carat_Hist_by_Cut

by(carat, cut, summary)
## Error: object 'carat' not found

It doesn't look like particular cuts have a certain number of carats. It looks like most of the ideal cut diamonds are less than one carat. I'm going to look at those values to be sure.

Ideal Cut Diamonds' Carat Values Sorted

sort(table(subset(diamonds, cut == "Ideal")$carat), decreasing = T)
## 
##  0.3 0.31 0.32 0.33 0.41  0.7  0.4 0.51 0.34 0.71 0.52 0.53 1.01  0.5 0.42 
## 1247 1209 1066  673  667  560  545  525  508  499  459  429  426  388  387 
## 0.38 0.35 0.54 0.72 0.56 0.36 0.55 1.02 0.73 0.57 1.03 0.43  0.9    1  1.2 
##  378  374  372  366  313  309  302  272  243  239  225  221  215  208  194 
## 1.04 0.58 1.51 0.39 1.06 0.37 1.21 1.07 1.05 1.09 0.59 0.74 1.11 1.23 0.76 
##  187  182  182  181  174  172  160  156  147  141  139  135  130  127  122 
## 1.08 1.22  1.1  1.5 0.27 1.13 1.52  0.6  0.8 0.91 0.26 0.61 1.12 0.44 1.24 
##  121  121  118  117  113  113  109  108  108  108  106  105  103   96   96 
## 0.77 0.75 1.14 0.81 1.16 0.46 0.78 0.28 2.01 1.25 1.53 0.24 1.26 0.25 0.79 
##   92   90   88   87   87   83   82   81   78   77   74   69   69   66   66 
## 0.82 0.62 0.83  1.7 1.15 1.17 1.27 1.55 0.92 1.54 0.29 1.18 1.31 0.63 1.19 
##   60   59   59   59   56   56   56   56   55   53   52   51   51   49   49 
## 1.28 1.57 2.02 1.56 0.23 0.45 0.47 1.58 1.29  1.6 0.64    2  1.3 2.03 1.71 
##   46   46   46   45   44   42   42   42   40   40   39   39   38   38   37 
## 0.93 1.59 1.32 0.85 1.34 1.33 1.35 0.65 1.61 1.62 0.48 1.63 0.66 0.84 1.37 
##   35   35   33   31   30   28   28   27   26   25   23   23   21   21   19 
## 1.75 2.07 0.96 0.97 1.36 2.05 2.04 2.06 0.87 1.74 0.95 1.65 1.67  2.1 1.39 
##   18   18   17   17   17   17   16   16   15   15   14   14   14   14   13 
##  1.4 1.64 0.86 2.08 2.09 2.14  2.2 0.94 1.38 1.66 1.68  1.8 2.16  2.3 0.88 
##   13   13   12   12   12   12   12   11   11   11   11   11   11   11   10 
## 1.41 1.72 1.76 2.11 2.15 1.42 1.69 0.67 1.43 2.12 2.18 2.22 1.73 2.24  2.4 
##   10   10   10   10   10    9    9    8    8    8    8    8    7    7    7 
## 0.49 0.89 2.13 2.19 2.21 2.28 2.36 0.69 0.98 0.99 2.17 2.25 2.26 2.32 1.77 
##    6    6    6    6    6    6    6    5    5    5    5    5    5    5    4 
## 1.79 2.37  2.5  0.2 0.68 1.44 1.46 1.49 1.83 1.87 1.91 2.27 2.29 2.51 2.54 
##    4    4    4    3    3    3    3    3    3    3    3    3    3    3    3 
## 1.78 1.85 1.89  1.9 1.98 2.33 2.39 2.45 2.46 2.53 2.61 2.72 3.01 1.45 1.47 
##    2    2    2    2    2    2    2    2    2    2    2    2    2    1    1 
## 1.48 1.82 1.84 1.86 1.92 1.93 2.34 2.41 2.42 2.43 2.47 2.48 2.49 2.52 2.56 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
## 2.59  2.6 2.63 2.64 2.75 3.22  3.5 
##    1    1    1    1    1    1    1

Most ideal cut diamonds are under 1.25 carats.

Diamond Counts by Cut

qplot(cut, data = diamonds, geom = "bar", fill = cut)

plot of chunk Diamond_Counts_by_Cut

Most diamonds have ideal cut, which is almost double the amount of very good cut diamonds.

Price by Cut

qplot(x = cut, y = price, data = diamonds, geom = "boxplot")

plot of chunk Price_by_Cut

by(diamonds$price, diamonds$cut, summary)
## diamonds$cut: Fair
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     337    2050    3280    4360    5210   18600 
## -------------------------------------------------------- 
## diamonds$cut: Good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     327    1140    3050    3930    5030   18800 
## -------------------------------------------------------- 
## diamonds$cut: Very Good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     336     912    2650    3980    5370   18800 
## -------------------------------------------------------- 
## diamonds$cut: Premium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326    1050    3180    4580    6300   18800 
## -------------------------------------------------------- 
## diamonds$cut: Ideal
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     878    1810    3460    4680   18800

Ideal diamonds have the lowest median price. This seems really unusual since I would expect diamonds with an ideal cut to have a higher median price compared to the other groups. There are many outliers. The variation in price tends to increase as cut improves and then decreases for diamonds with ideal cuts. What about price/carat for these cuts?

Price per Carat by Cut

qplot(x = cut, y = price/carat, data = diamonds, geom = "boxplot")

plot of chunk Price_Per_Carat_Cut

qplot(x = cut, y = price/carat, data = diamonds, geom = "boxplot") + coord_cartesian(ylim = c(0, 
    6000))

plot of chunk Price_Per_Carat_Cut

with(diamonds, by(price/carat, cut, summary))
## cut: Fair
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1170    2740    3450    3770    4510   10900 
## -------------------------------------------------------- 
## cut: Good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1080    2390    3610    3860    4790   15900 
## -------------------------------------------------------- 
## cut: Very Good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1140    2330    3610    4010    5020   17800 
## -------------------------------------------------------- 
## cut: Premium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1050    2590    3760    4220    5320   17100 
## -------------------------------------------------------- 
## cut: Ideal
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1110    2460    3310    3920    4770   17100

Wow! Ideal diamonds have the lowest median for price per carat. The variance across the groups seems to be about the same with Fair cut diamonds having the least variation for the middle 50% of diamonds.

Diamond Counts by Color

qplot(color, data = diamonds, geom = "bar", fill = color)

plot of chunk Diamond_Counts_by_Color

Most diamonds have have color ratings between E and H.

Price by Color

qplot(x = color, y = price, data = diamonds, geom = "boxplot")

plot of chunk Price_by_Color

by(diamonds$price, diamonds$color, summary)
## diamonds$color: D
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     357     911    1840    3170    4210   18700 
## -------------------------------------------------------- 
## diamonds$color: E
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     882    1740    3080    4000   18700 
## -------------------------------------------------------- 
## diamonds$color: F
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     342     982    2340    3720    4870   18800 
## -------------------------------------------------------- 
## diamonds$color: G
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     354     931    2240    4000    6050   18800 
## -------------------------------------------------------- 
## diamonds$color: H
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     337     984    3460    4490    5980   18800 
## -------------------------------------------------------- 
## diamonds$color: I
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1120    3730    5090    7200   18800 
## -------------------------------------------------------- 
## diamonds$color: J
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     335    1860    4230    5320    7700   18700

Here is another surprise. The lowest median price diamonds have a color of D, which is the best color in the data set. Price variance increases as the color decreases (best color is D and the worst color is J). The median price typically decreases as color improves. Let's look at price per carat by color.

Price per Carat by Color

qplot(x = color, y = price/carat, data = diamonds, geom = "boxplot")

plot of chunk Price_Per_Carat_Color

with(diamonds, by(price/carat, color, summary))
## color: D
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1130    2460    3410    3950    4750   17800 
## -------------------------------------------------------- 
## color: E
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1080    2430    3250    3800    4510   14600 
## -------------------------------------------------------- 
## color: F
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1170    2590    3490    4130    4950   13900 
## -------------------------------------------------------- 
## color: G
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1140    2540    3490    4160    5500   12500 
## -------------------------------------------------------- 
## color: H
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1050    2400    3820    4010    5130   10200 
## -------------------------------------------------------- 
## color: I
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1150    2340    3780    4000    5200    9400 
## -------------------------------------------------------- 
## color: J
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1080    2560    3780    3830    4930    8650

The best color diamonds (D and E) have the lowest median price. Again, this is such an unusual trend. This also seems strange since most diamonds in the data set are not of color D. Let's split up the price / carat distribution by color.

Price per Carat Hist by Color

qplot(x = price/carat, data = diamonds, fill = color)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk Price_Per_Carat_Hist_Color

qplot(x = price/carat, data = diamonds, fill = color) + facet_wrap(~cut)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk Price_Per_Carat_Hist_Color

It looks like the diamonds with better cuts and color tend to have lower price / carat values. This provides some explanation for the odd low median price and price / carat for better cuts and colors, but I'm still not clear on this. I'm going to keep this in mind and try to explore the same plots for clarity.

Diamond Counts by Clarity

qplot(clarity, data = diamonds, geom = "bar", fill = clarity)

plot of chunk Diamond_Counts_by_Clarity

Most diamonds have average clarity ratings. Very few diamonds have the worst or best clarity rating, like the rating pattern for color.

Price by Clarity

qplot(x = clarity, y = price, data = diamonds, geom = "boxplot")

plot of chunk Price_by_Clarity

by(diamonds$price, diamonds$clarity, summary)
## diamonds$clarity: I1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     345    2080    3340    3920    5160   18500 
## -------------------------------------------------------- 
## diamonds$clarity: SI2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326    2260    4070    5060    5780   18800 
## -------------------------------------------------------- 
## diamonds$clarity: SI1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326    1090    2820    4000    5250   18800 
## -------------------------------------------------------- 
## diamonds$clarity: VS2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334     900    2050    3920    6020   18800 
## -------------------------------------------------------- 
## diamonds$clarity: VS1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     327     876    2000    3840    6020   18800 
## -------------------------------------------------------- 
## diamonds$clarity: VVS2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     336     794    1310    3280    3640   18800 
## -------------------------------------------------------- 
## diamonds$clarity: VVS1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     336     816    1090    2520    2380   18800 
## -------------------------------------------------------- 
## diamonds$clarity: IF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     369     895    1080    2860    2390   18800

Here again, there is a trend that goes against my intuition. The lowest median price occurs for the best clarity (IF). There also to be many more outliers for the better clarity diamonds. I'm not sure why great clarity diamonds are price so low. Another trend to note here is that price variance increases then decreases significantly as the clarity improves.

I want to look at two things: price per clarity, and the distribution of prices for diamonds with best levels of the categorical variables.

Price per Carat by Clarity

qplot(x = clarity, y = price/carat, data = diamonds, geom = "boxplot")

plot of chunk Price_Per_Carat_Clarity

with(diamonds, by(price/carat, clarity, summary))
## clarity: I1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1050    2110    2890    2800    3350    6350 
## -------------------------------------------------------- 
## clarity: SI2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1080    3000    3950    4010    4740    9910 
## -------------------------------------------------------- 
## clarity: SI1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1130    2360    3670    3850    4930    9690 
## -------------------------------------------------------- 
## clarity: VS2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1150    2440    3430    4080    5480   12500 
## -------------------------------------------------------- 
## clarity: VS1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1210    2410    3450    4160    5490   12400 
## -------------------------------------------------------- 
## clarity: VVS2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1340    2450    3170    4200    4940   13400 
## -------------------------------------------------------- 
## clarity: VVS1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1400    2540    2980    3850    4060   14500 
## -------------------------------------------------------- 
## clarity: IF
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1590    2860    3160    4260    4280   17800

This plot seems more reasonable. The lowest median price per carat has clarity I1 which is the lowest clarity rating. The median increases slightly then holds relatively constant before decreasing again for the highest clarity. The variance increases then decreases across the clarity levels from worst to best.

Price per Carat Hist by Clarity

qplot(x = price/carat, data = diamonds, fill = clarity) + guides(fill = guide_legend(reverse = T))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk Price_Per_Carat_Hist_Clarity

qplot(x = price/carat, data = diamonds, fill = clarity) + facet_wrap(~cut)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk Price_Per_Carat_Hist_Clarity

qplot(x = price/carat, data = diamonds, fill = clarity) + facet_wrap(~color)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk Price_Per_Carat_Hist_Clarity

The histogram and faceted histograms somewhat explain the odd trends as again there is a greater number of ideal diamonds, color D diamonds, and clarity IF diamonds in the lower price ranges. Let's look at the price distribution of the higher quality diamonds in cut, color, and clarity.

Price Hist Best Diamonds

bestDiamonds <- subset(diamonds, (color == "D" | color == "E") & (clarity == 
    "IF" | clarity == "VVS1") & (cut == "Ideal" | cut == "Premium"))
qplot(x = price, data = bestDiamonds)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk Price_Hist_Best_Diamonds

summary(bestDiamonds$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     553     967    1210    2890    2640   18700
summary(bestDiamonds$price/bestDiamonds$carat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2170    2980    3420    4710    5020   17100

Let's compare to the worst diamonds across the same variables.

Price Hist Worst Diamonds

worstDiamonds <- subset(diamonds, (color == "J" | color == "I") & (clarity == 
    "I1" | clarity == "SI2") & (cut == "Fair" | cut == "Good"))
qplot(x = price, data = worstDiamonds)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

plot of chunk Price_Hist_Worst_Diamonds

summary(worstDiamonds$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     335    2810    4310    5750    7560   18500
summary(worstDiamonds$price/worstDiamonds$carat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1080    2640    3320    3580    4280    7440

This doesn't add much to my thoughts already. I'm going create density plots that are similar to the price histograms earlier to examine the price for each level of cut, color, and clarity.

Price Histograms by Qualitative Factors

qplot(price, data = diamonds, binwidth = 0.01, color = cut, geom = "density") + 
    scale_x_log10(breaks = c(100, 500, 800, 1500, 5000, 10000))

plot of chunk Price_Histograms_by_Qualitative_Factors

qplot(price, data = diamonds, binwidth = 0.01, color = color, geom = "density") + 
    scale_x_log10(breaks = c(100, 500, 800, 1500, 5000, 10000))

plot of chunk Price_Histograms_by_Qualitative_Factors

qplot(price, data = diamonds, binwidth = 0.01, color = clarity, geom = "density") + 
    scale_x_log10(breaks = c(100, 500, 800, 1500, 5000, 10000)) + guides(color = guide_legend(reverse = T))

plot of chunk Price_Histograms_by_Qualitative_Factors

These density plots explain the odd trends that were seen in the box plots earlier. Diamonds with better levels of clarity, cut, and color tend to occur more often at lower prices while diamonds with worse levels of clarity, cut, and color tend to occur more often at higher prices. I am wondering about price / carat too.

Price per Carat Density Plots

qplot(price/carat, data = diamonds, binwidth = 0.01, color = cut, geom = "density") + 
    scale_x_log10(breaks = c(1000, 2000, 3500, 5000, 10000))

plot of chunk Price_per_Carat_Density

qplot(price/carat, data = diamonds, binwidth = 0.01, color = color, geom = "density") + 
    scale_x_log10(breaks = c(1000, 2000, 3500, 5000, 10000))

plot of chunk Price_per_Carat_Density

qplot(price/carat, data = diamonds, binwidth = 0.01, color = clarity, geom = "density") + 
    scale_x_log10(breaks = c(1000, 2000, 3000, 4500, 7000, 10000)) + guides(color = guide_legend(reverse = T))

plot of chunk Price_per_Carat_Density

These plots support the variability and trends that the boxplots showed from before. I am going see which variables correlate with price and try to work towards building a linear model to predict price.

Produce a simple correlation matrix for variable pairs.

library(lsr)
correlate(diamonds)
## 
## CORRELATIONS
## ============
## - correlation type:  pearson 
## - correlations shown only when both variables are numeric
## 
##         carat cut color clarity  depth  table  price      x      y     z
## carat       .   .     .       .  0.028  0.182  0.922  0.975  0.952 0.953
## cut         .   .     .       .      .      .      .      .      .     .
## color       .   .     .       .      .      .      .      .      .     .
## clarity     .   .     .       .      .      .      .      .      .     .
## depth   0.028   .     .       .      . -0.296 -0.011 -0.025 -0.029 0.095
## table   0.182   .     .       . -0.296      .  0.127  0.195  0.184 0.151
## price   0.922   .     .       . -0.011  0.127      .  0.884  0.865 0.861
## x       0.975   .     .       . -0.025  0.195  0.884      .  0.975 0.971
## y       0.952   .     .       . -0.029  0.184  0.865  0.975      . 0.952
## z       0.953   .     .       .  0.095  0.151  0.861  0.971  0.952     .

The dimensions of a diamond tend to correlate with each other. The longer one dimension, then the larger the diamond. The dimensions also correlate with carat weight which makes sense. Price correlates strongly with carat weight and the three dimensions (x, y, z).

ggpairs: sample diamonds

library(GGally)
## Loading required package: reshape
## Loading required package: plyr
## 
## Attaching package: 'reshape'
## 
## The following objects are masked from 'package:plyr':
## 
##     rename, round_any
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
## Loading required namespace: car
## 
## Attaching package: 'memisc'
## 
## The following object is masked from 'package:reshape':
## 
##     rename
## 
## The following object is masked from 'package:plyr':
## 
##     rename
## 
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## 
## The following object is masked from 'package:base':
## 
##     as.array

# sample 5,000 diamonds from the data set
set.seed(281)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 5000), ]
ggpairs(diamond_samp, params = c(shape = I("."), outlier.shape = I(".")))

plot of chunk ggpairs

I want to closer at scatter plots involving price and some other variables: carat, table, depth, and volume.

Price vs Carat

qplot(x = carat, y = price, data = diamonds)

plot of chunk Prie_vs_Carat

qplot(x = carat, y = price, data = diamonds) + coord_cartesian(xlim = c(0, quantile(diamonds$carat, 
    0.99)))

plot of chunk Prie_vs_Carat

As carat size increases, the variance in price increases. We still see vertical bands where many diamonds take on the same carat value at different price points. The relationship between price and carat appears to be exponential rather than linear.

Price vs Table

qplot(x = table, y = price, data = diamonds)

plot of chunk Price_vs_Table

Again, the tall vertical strips indicate table values are mostly integers. A few outliers below 50 mm and one above 90 mm.

Price vs Table by Cut

qplot(x = table, y = price, data = diamonds, color = cut)

plot of chunk Price_vs_Table_by_Cut

qplot(x = table, y = price, data = diamonds, color = cut) + facet_wrap(~clarity)

plot of chunk Price_vs_Table_by_Cut

qplot(x = table, y = price, data = diamonds, color = cut) + facet_wrap(~color)

plot of chunk Price_vs_Table_by_Cut

Levels of cut cluster by table value. This may make sense based on the type of cut as certain cuts produce certain dimensions. The pattern holds across each level of clarity and each level of color with the exception of the lowest clarity.

Price vs. Table by Color

qplot(x = table, y = price, data = diamonds, color = color)

plot of chunk Price_vs_Table_by_Color

Nothing stands out in the plot above.

Price vs Table by Clarity

qplot(x = table, y = price, data = diamonds, color = clarity)

plot of chunk Price_vs_Table_by_Clarity

Nothing stands out in the plot above.

Prive vs Depth

qplot(x = depth, y = price, data = diamonds)

plot of chunk Price_vs_Depth

ggplot(aes(x = depth, y = price), data = diamonds) + geom_point(alpha = 1/50)

plot of chunk Price_vs_Depth

First plot suffers from overplotting. Most diamonds have a depth between 60 and 65 (no units).

Create a volume variable in the data frame.

What about the volume of diamonds? Does it have any relationships with price and other variables in the data set? I'm going to use a rough approximation of volume by using x * y * z to approximate a diamond as if it were a rectangular prism, basically a box.

diamonds <- transform(diamonds, volume = x * y * z)

Price vs Volume

qplot(x = volume, y = price, data = diamonds)

plot of chunk Price_vs_Volume

Some diamonds have a volume of 0. Why? There's other outliers: expensive diamond near volume of 4000 and less expensive diamond priced below 1000.

How many diamonds have a volume of 0?

table(diamonds$volume == 0)
## 
## FALSE  TRUE 
## 53920    20

Look at subset of diamonds with volume of 0

noVolume <- subset(diamonds, volume == 0)
noVolume
##       carat       cut color clarity depth table price    x    y z volume
## 2208   1.00   Premium     G     SI2  59.1    59  3142 6.55 6.48 0      0
## 2315   1.01   Premium     H      I1  58.1    59  3167 6.66 6.60 0      0
## 4792   1.10   Premium     G     SI2  63.0    59  3696 6.50 6.47 0      0
## 5472   1.01   Premium     F     SI2  59.2    58  3837 6.50 6.47 0      0
## 10168  1.50      Good     G      I1  64.0    61  4731 7.15 7.04 0      0
## 11183  1.07     Ideal     F     SI2  61.6    56  4954 0.00 6.62 0      0
## 11964  1.00 Very Good     H     VS2  63.3    53  5139 0.00 0.00 0      0
## 13602  1.15     Ideal     G     VS2  59.2    56  5564 6.88 6.83 0      0
## 15952  1.14      Fair     G     VS1  57.5    67  6381 0.00 0.00 0      0
## 24395  2.18   Premium     H     SI2  59.4    61 12631 8.49 8.45 0      0
## 24521  1.56     Ideal     G     VS2  62.2    54 12800 0.00 0.00 0      0
## 26124  2.25   Premium     I     SI1  61.3    58 15397 8.52 8.42 0      0
## 26244  1.20   Premium     D    VVS1  62.1    59 15686 0.00 0.00 0      0
## 27113  2.20   Premium     H     SI1  61.2    59 17265 8.42 8.37 0      0
## 27430  2.25   Premium     H     SI2  62.8    59 18034 0.00 0.00 0      0
## 27504  2.02   Premium     H     VS2  62.7    53 18207 8.02 7.95 0      0
## 27740  2.80      Good     G     SI2  63.8    58 18788 8.90 8.85 0      0
## 49557  0.71      Good     F     SI2  64.1    60  2130 0.00 0.00 0      0
## 49558  0.71      Good     F     SI2  64.1    60  2130 0.00 0.00 0      0
## 51507  1.12   Premium     G      I1  60.4    59  2383 6.71 6.67 0      0
summary(noVolume$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2130    3560    5350    8800   15500   18800
summary(diamonds$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     950    2400    3930    5320   18800
table(diamonds$carat == 0)
## 
## FALSE 
## 53940

All dimensions (x, y, and z) are missing or the z value is 0. The diamonds in this subset tend to be very expensive or fall in the third quartile of the entire diamonds data set. Other variables such as carat, depth, table, and price are reported so I'll assume those values can be trusted.

Price vs Volume Subset with Linear Model

quantile(diamonds$volume, probs = seq(0, 1, 0.01))
##      0%      1%      2%      3%      4%      5%      6%      7%      8% 
##    0.00   40.08   43.94   47.69   48.66   49.18   49.61   49.97   50.32 
##      9%     10%     11%     12%     13%     14%     15%     16%     17% 
##   50.69   51.09   51.54   51.93   52.40   52.85   53.38   53.99   54.73 
##     18%     19%     20%     21%     22%     23%     24%     25%     26% 
##   55.55   56.60   57.80   59.16   60.89   62.65   63.94   65.14   66.00 
##     27%     28%     29%     30%     31%     32%     33%     34%     35% 
##   66.67   67.27   68.06   69.01   70.54   73.70   79.72   81.55   82.57 
##     36%     37%     38%     39%     40%     41%     42%     43%     44% 
##   83.46   84.46   85.52   86.71   87.97   89.47   90.99   92.80   94.98 
##     45%     46%     47%     48%     49%     50%     51%     52%     53% 
##   98.02  103.00  110.53  112.63  113.86  114.81  115.69  116.59  117.50 
##     54%     55%     56%     57%     58%     59%     60%     61%     62% 
##  118.75  120.43  123.04  126.36  130.20  135.45  141.72  144.29  146.28 
##     63%     64%     65%     66%     67%     68%     69%     70%     71% 
##  148.14  151.06  156.45  159.09  160.63  161.88  162.83  163.74  164.85 
##     72%     73%     74%     75%     76%     77%     78%     79%     80% 
##  166.03  167.28  168.83  170.84  172.95  175.61  178.47  181.69  185.07 
##     81%     82%     83%     84%     85%     86%     87%     88%     89% 
##  188.65  192.61  195.85  198.55  201.97  206.50  213.26  225.33  238.08 
##     90%     91%     92%     93%     94%     95%     96%     97%     98% 
##  242.31  245.23  248.55  253.99  263.07  276.53  307.71  324.37  333.91 
##     99%    100% 
##  354.43 3840.60
quantile(diamonds$volume, 0.999)
## 99.9% 
## 431.6
qplot(x = volume, y = price, color = I("orange"), alpha = I(1/20), data = subset(diamonds, 
    volume > 0 & volume <= quantile(diamonds$volume, 0.999))) + geom_smooth(method = "lm", 
    se = T, color = "blue")

plot of chunk Price_vs_Volume_Linear_Model

As the volume increases, the variance in price increases. That is, the data becomes more dispersed. The relationship does not look linear and appears more exponential, especially in the original plot of price vs. volume.

Linear Model

volumePriceLM <- lm(price ~ volume, data = subset(diamonds, volume > 0 & volume <= 
    quantile(diamonds$volume, 0.999)))
summary(volumePriceLM)
## 
## Call:
## lm(formula = price ~ volume, data = subset(diamonds, volume > 
##     0 & volume <= quantile(diamonds$volume, 0.999)))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10739   -809    -12    565  12575 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.38e+03   1.29e+01    -185   <2e-16 ***
## volume       4.87e+01   8.59e-02     566   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1510 on 53864 degrees of freedom
## Multiple R-squared:  0.856,  Adjusted R-squared:  0.856 
## F-statistic: 3.21e+05 on 1 and 53864 DF,  p-value: <2e-16

Based on the R2 value, volume explains about 85 percent of the variance in price. Let's keep looking at other variables, including the categorical ones. The volume was a rough approximation though since the diamonds were assumed to be box shaped. Let's use carat and the density of diamonds to better approximate volume.

Any diamonds with carat value of 0?

table(diamonds$carat == 0)
## 
## FALSE 
## 53940

I'll also have volumes for diamonds that had missing x, y, or z dimensions since carat weight was reported for all diamonds.

Better Volume Approximation

1 carat is equivalent to 2 grams

Let's use the average density of diamonds to compute the volume. Using Google, I found that diamond density is typically 3.15-3.53 g/cm3 with pure diamonds having a density close to 3.52 g/cm3. I'm going to use the average density 3.34 g/cm3 to estimate the volume of the diamonds.

# create a volume from carat and density of diamonds
diamonds <- transform(diamonds, volume = carat * 2 * 3.34)

Price vs Better Volume

qplot(x = volume, y = price, data = diamonds)

plot of chunk Price_vs_Better_Volume

No volumes that are 0. Still have some outliers, but they are less extreme.

Price vs Volume colored by Clarity

qplot(y = price, x = volume, data = diamonds, color = clarity) + scale_color_brewer(type = "div") + 
    scale_y_log10()

plot of chunk Price_vs_Volume_and_Clarity


# There's the same outliers as before. Look at diamonds with volumes < 1000
# and > 0.

qplot(y = price, x = volume, data = subset(diamonds, volume < 1000 & volume > 
    0), color = clarity) + scale_color_brewer(type = "div") + scale_y_log10()

plot of chunk Price_vs_Volume_and_Clarity

Diamonds are priced higher for better clarity holding volume constant.

Price vs Volume colored by Cut

qplot(y = price, x = volume, data = subset(diamonds, volume < 1000 & volume > 
    0), color = cut) + scale_color_brewer(type = "div") + scale_y_log10()

plot of chunk Price_vs_Volume_and_Cut

We lose the pattern when coloring by cut.

Price vs Volume colored by Color

qplot(y = price, x = volume, data = subset(diamonds, volume < 1000 & volume > 
    0), color = color) + scale_color_brewer(type = "div") + scale_y_log10()

plot of chunk Price_vs_Volume_and_Color

Diamonds with better color tend to be priced higher holding volume constant. This trend is not as clear or stong when the points were colored by clarity, but the trend is still present.

Create a new function to transform the carat variable

The relationship between price and carat seemed exponential. Let's use the cube root of carat weight to further explore the relationship of price and carat. This seems reasonable given that carat weight is a function of volume and the volume is a function of a diamond's dimensions.

library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:memisc':
## 
##     percent
cuberoot_trans = function() trans_new("cuberoot", transform = function(x) x^(1/3), 
    inverse = function(x) x^3)

Log10 Price and Cube Root of Carat

ggplot(aes(carat, price), data = diamonds) + geom_point() + scale_x_continuous(trans = cuberoot_trans(), 
    limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), 
    limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle("Price (log10) by Cube-Root of Carat")
## Warning: Removed 1683 rows containing missing values (geom_point).

plot of chunk Use_cuberoot_trans

Price vs Carat and Clarity

ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) + geom_point(alpha = 0.5, 
    size = 1, position = "jitter") + scale_color_brewer(type = "div", guide = guide_legend(title = "Clarity", 
    reverse = T, override.aes = list(alpha = 1, size = 2))) + scale_x_continuous(trans = cuberoot_trans(), 
    limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), 
    limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle("Price (log10) by Cube-Root of Carat and Clarity")
## Warning: Removed 1691 rows containing missing values (geom_point).

plot of chunk Price_vs_Carat_and_Clarity

Holding carat weight constant, diamonds with lower clarity are almost always cheaper than diamonds with better clarity (worst clarity is I1 and best clarity is IF).

Price vs Carat and Cut

ggplot(aes(x = carat, y = price, color = cut), data = diamonds) + geom_point(alpha = 0.5, 
    size = 1, position = "jitter") + scale_color_brewer(type = "div", guide = guide_legend(title = "Cut", 
    reverse = T, override.aes = list(alpha = 1, size = 2))) + scale_x_continuous(trans = cuberoot_trans(), 
    limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), 
    limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle("Price (log10) by Cube-Root of Carat and Cut")
## Warning: Removed 1691 rows containing missing values (geom_point).

plot of chunk Price_vs_Carat_and_Cut

Price does not vary as much on cut holding carat constant; the pattern is not noticeable here.

Price vs Carat and Color

ggplot(aes(x = carat, y = price, color = color), data = diamonds) + geom_point(alpha = 0.5, 
    size = 1, position = "jitter") + scale_color_brewer(type = "div", guide = guide_legend(title = "Color", 
    override.aes = list(alpha = 1, size = 2))) + scale_x_continuous(trans = cuberoot_trans(), 
    limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), 
    limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle("Price (log10) by Cube-Root of Carat and Color")
## Warning: Removed 1691 rows containing missing values (geom_point).

plot of chunk Price_vs_Carat_and_Color

Color does seem to explain some of the variance in price as was the case with the clarity variable.

Building a Linear Model

The last 3 plots suggest that we can build a linear model and use those variables in the linear model to predict the price of a diamond.

m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~. + carat)
m3 <- update(m2, ~. + clarity)
m4 <- update(m3, ~. + cut)
m5 <- update(m4, ~. + color)
mtable(m1, m2, m3, m4, m5)
## 
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + clarity, 
##     data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + clarity + 
##     cut, data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + clarity + 
##     cut + color, data = diamonds)
## 
## ======================================================================
##                     m1         m2         m3         m4         m5    
## ----------------------------------------------------------------------
## (Intercept)      2.821***   1.039***   0.464***   0.391***   0.415*** 
##                 (0.006)    (0.019)    (0.014)    (0.014)    (0.010)   
## I(carat^(1/3))   5.558***   8.568***   9.319***   9.376***   9.144*** 
##                 (0.007)    (0.032)    (0.023)    (0.023)    (0.016)   
## carat                      -1.137***  -1.260***  -1.274***  -1.093*** 
##                            (0.012)    (0.008)    (0.008)    (0.006)   
## clarity: .L                            0.889***   0.854***   0.907*** 
##                                       (0.005)    (0.005)    (0.003)   
## clarity: .Q                           -0.255***  -0.239***  -0.240*** 
##                                       (0.005)    (0.005)    (0.003)   
## clarity: .C                            0.143***   0.129***   0.131*** 
##                                       (0.004)    (0.004)    (0.003)   
## clarity: ^4                           -0.086***  -0.080***  -0.063*** 
##                                       (0.003)    (0.003)    (0.002)   
## clarity: ^5                            0.038***   0.034***   0.026*** 
##                                       (0.003)    (0.003)    (0.002)   
## clarity: ^6                            0.001      0.004     -0.002    
##                                       (0.002)    (0.002)    (0.002)   
## clarity: ^7                            0.054***   0.051***   0.032*** 
##                                       (0.002)    (0.002)    (0.001)   
## cut: .L                                           0.125***   0.120*** 
##                                                  (0.003)    (0.002)   
## cut: .Q                                          -0.034***  -0.031*** 
##                                                  (0.003)    (0.002)   
## cut: .C                                           0.016***   0.014*** 
##                                                  (0.002)    (0.002)   
## cut: ^4                                          -0.001     -0.002    
##                                                  (0.002)    (0.001)   
## color: .L                                                   -0.441*** 
##                                                             (0.002)   
## color: .Q                                                   -0.093*** 
##                                                             (0.002)   
## color: .C                                                   -0.013*** 
##                                                             (0.002)   
## color: ^4                                                    0.012*** 
##                                                             (0.002)   
## color: ^5                                                   -0.003*   
##                                                             (0.001)   
## color: ^6                                                    0.001    
##                                                             (0.001)   
## ----------------------------------------------------------------------
## R-squared            0.924      0.935      0.967      0.968      0.984
## adj. R-squared       0.924      0.935      0.967      0.968      0.984
## sigma                0.280      0.259      0.185      0.181      0.129
## F               652012.063 387489.366 175093.345 125821.403 173791.084
## p                    0.000      0.000      0.000      0.000      0.000
## Log-likelihood   -7962.499  -3631.319  14605.945  15580.358  34091.272
## Deviance          4242.831   3613.360   1837.549   1772.344    892.214
## AIC              15930.999   7270.637 -29189.890 -31130.717 -68140.544
## BIC              15957.685   7306.220 -29092.038 -30997.282 -67953.736
## N                53940      53940      53940      53940      53940    
## ======================================================================

The variables in this linear model can account for 98.4% of the variance in the price of diamonds. The addition of the cut variable to the model slightly improves the R2 value by one tenth of a percent, which is expected based on the visualization above of Log10 Price vs. Cube-Root Carat and Cut.


Final Plots and Summary

Final Plot One

qplot(price, data = diamonds, binwidth = 0.01, fill = I("#099DD9")) + scale_x_log10(breaks = c(1000, 
    1500, 10000)) + ggtitle("Log10 Price") + xlab("Price (in dollars)") + ylab("Number of Diamonds")
## Warning: position_stack requires constant width: output may be incorrect

plot of chunk Final_Plot_One

The distribution of diamond prices appears to be bimodal, perhaps due to the demand of diamonds and buyers purchasing in two different ranges of price points.

Final Plot Two

library(gridExtra)
## Loading required package: grid

plot1 <- qplot(x = clarity, y = price, data = diamonds, geom = "boxplot", fill = clarity) + 
    ggtitle("Diamond Prices by Cut") + xlab("Clarity") + ylab("Price (in dollars)") + 
    coord_cartesian(ylim = c(0, 7000)) + theme(plot.title = element_text(size = 16), 
    legend.position = "none")

plot2 <- qplot(price, data = diamonds, binwidth = 0.01, color = clarity, geom = "density") + 
    scale_x_log10(breaks = c(1000, 2000, 3000, 4500, 7000, 10000)) + guides(color = guide_legend(reverse = T)) + 
    xlab("Price/Carat ($/ct)") + ylab("Density") + ggtitle("Density of Price/Carat by Clarity") + 
    theme(plot.title = element_text(size = 16))

grid.arrange(plot1, plot2, ncol = 1)

plot of chunk Final_Plot_Two

Diamonds with the best level of clarity (IF) have the lowest median price. A greater proportion of diamonds with the best clarity are priced lowered compared to the proportion of diamonds in other price distributions for worse levels of clarity. Price variance increases as the clarity improves (worst clarity is I1).

Final Plot Three

ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) + geom_point(alpha = 0.5, 
    size = 1, position = "jitter") + scale_color_brewer(type = "div", guide = guide_legend(title = "Clarity", 
    reverse = T, override.aes = list(alpha = 1, size = 2))) + scale_x_continuous(trans = cuberoot_trans(), 
    limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), 
    limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle("Price (log10) by Cube-Root of Carat and Clarity") + 
    theme(plot.title = element_text(size = 16))
## Warning: Removed 1692 rows containing missing values (geom_point).

plot of chunk Final_Plot_Three

Holding carat weight constant, diamonds with higher clarity levels (I1 is worst and IF is best) are almost always cheaper than diamonds with better clarity. The plot indicates that a linear model could be constructed to predict the price of variables using log10(price) as the outcome variable and cube-root of carat as the predictor variable


Reflection

The diamonds data set contains information on almost 54,000 thousand diamonds from around 2008. I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the price of diamonds across many variables and created a linear model to predict diamond prices. I was surprised that depth or table did not have a strong positive correlation with price, but these variables are likely to be represented by categorical variables: color, cut, and clarity. I struggled understanding the decrease in median price as the level of cut and clarity improved, but this became more clear when I realized that most of the data contained ideal cut diamonds. For the linear model, all diamonds were included since information on price, carat, color, clarity, and cut were available for all the diamonds. Some limitations of this model include the source of the data. Given that the diamonds date to 2008, the model would likely undervalue diamonds in the market today, either due to changes in demand and supply or inflation rates. To investigat this data further, I would examine how values of 0 were introduced into the data set for the variables volume, x, y, and z. I would be interested in testing the linear model to predict current diamond prices and to determine to what extent the model is accurate at pricing diamonds. A more recent data would be better to make predictions of diamond prices, and comparisons might be made between the other linear models to see if other variables may account for diamond prices.